Dataset: smashy_super

For the dataset smashy_super the target is yval, which is a logloss performance measurement. Values close to 0 mean good performance. First, of all we want to know which parameter is important in general.

Data Preparation

We need to load packages and subset the data to compare the whole dataset and the dataset with the 20% of configurations with the best outcome. In addition, the data must be manipulated to facilitate the use of the data for summaries and filters.

Load Data

library(VisHyp)
library(mlr3)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
smashy_super <- readRDS("D:/Simon/Desktop/Studium/6. Semester/Bachelorarbeit/package_VisHyp/data-raw/smashy_super.rds")
smashy_super <- as.data.frame(smashy_super)

n <- length(smashy_super)
for (i in 1:n) {
  if(is.logical(smashy_super[,i]))
    smashy_super[,i] <- as.factor(smashy_super[,i])
  if(is.character(smashy_super[,i]))
    smashy_super[,i] <- as.factor(smashy_super[,i])
}

Create Task

superTask <- TaskRegr$new(id = "smashy_super", backend = smashy_super, target = "yval")
superBest <- smashy_super[smashy_super$yval >= quantile(smashy_super$yval, 0.8),]
superTaskBest <- TaskRegr$new(id = "taskBest", backend = superBest, target = "yval")

Results

The target parameter yval can reach values between -0.3732 and -0.2105. Our goal is to obtain good results, i.e., to find configurations that produce values close to -0.2105.

The “random” samples perform better on average than the “bohb” samples. For the top 20% configurations, many “bohb” samples have been sorted out, but the remaining ones have on average a better performance than the “random” samples. In the end, both samples can lead to good performance values but since a lot of the remaining samples are “random” we will choose this value.

In general, for the parameter survival_fraction lower values perform better than higher values. Both subsets start with a low value and reach their maximum value directly afterwards. For the top configurations, higher values do not seem to be worse so that with good configurations of other paraemter the value of this parameter can be also high. Although not all high values have poor performance, lower values seem to be the right choice since most good configurations have lower values. A value between 0.05 and 0.30 seems to be a good choice for the “knn1” surrogate_learner.

The surrogate_learner parameter is one of the most important parameters for the whole dataset. After reducing the dataset to the best 20% of configurations, we could see that the parameter lost importance, since the best surrogate_learner were mainly “knn1”. Even though we found that for all other Surrogate_learner the best configuration could achieve a better yval than “knn1”, it makes sense to choose knn1 because of better results on average.

The most important parameter for the best 20% of the configurations was the random_interleave_fraction parameter. In this case, the results were unambiguous, so higher values led to better results for both the full data set and the subset. Another early indicator in the analysis was the summary of the full and split data sets. It could be seen that the summary indices for the subset were all higher. All effect tools such as the PDP, PCP, and Heatmap also showed these results. For our purpose, we only take values above 0.5, which is about half.

A similar problem occurs with budget_log_step. In the full dataset, higher values are better, but in the top 20% of configurations, lower values achieve better yval values. But unlike random_interleave_fraction, there are more configurations with good results in the split dataset. Also, it is a very important parameter for the top 20% configurations, so it should not be neglected that good performance values can be achieved with lower budget_log_step values. In this case it is better not to limit the parameter.

In the best parameter configurations in combination with “knn1” values of the surrogate_learner parameter, the filter_factor_first parameter was the most important parameter. In the full data set, this parameter was not important at all. There is also a difference in the range of good configurations. In the full dataset, values above 6 did not perform well, while in the subdivided dataset, values above 6 produced the best results. Even after subdividing into the best 20% of configurations, the majority of good values were above 4, so it can be said that values above 4 seem to be a good choice for this parameter.

A little more complicated was the interpretation of filter_factor_last. Filter_factor_last has large fluctuations and different good ranges depending on whether we look at the full or partial data set. Moreover, we can say that although the importance is high due to the large fluctuations, the range of predicted performances is not very large (which actually refutes the importance). In general, however, one can say that the parameter value for Filter_factor_last should be between 1.5 and 2.5, or above 5.5. Or at least not between 4 and 5.

A really good parameter to interpret is filter_with_max_budget. This parameter is not really important in the full dataset, but for the best configurations in combination with “knn1” one can say that “TRUE” should be the choice.

filter_algorithm, filter_select_per_tournament and random_interleave_random have barely an effect and therefore do not need to be limited.

Data constraint to check the results

To verify the proposed parameter configurations, we constrain the dataset and compare the obtained performance with the ranks of the performance of the whole dataset.

final <- smashy_super[smashy_super$sample == "random",] 
final <- final[final$survival_fraction > 0.05 & final$survival_fraction < 0.3,] 
final <- final[final$surrogate_learner == "knn1",] 
final <- final[final$random_interleave_fraction > 0.5,]
final <- final[final$filter_factor_first > 4,]
final <- final[final$filter_factor_last < 4 | final$filter_factor_last > 5,]
final <- final[final$filter_with_max_budget == "TRUE",]

yval <- sort(final$yval, decreasing = TRUE)
yval_original <- sort(smashy_super$yval, decreasing = TRUE)
sort(match(yval, yval_original), decreasing = FALSE)
##  [1]   20   40   49   58   62   69   79  107  112  115  116  130  152  161  162
## [16]  178  182  184  189  206  208  218  238  241  242  264  274  276  277  280
## [31]  295  296  300  305  318  319  331  332  336  340  356  377  378  382  388
## [46]  393  404  432  434  442  446  450  452  486  489  490  501  509  513  534
## [61]  539  547  550  568  578  602  604  605  621  626  632  637  661  664  682
## [76]  722  731  737  744  754  774  794  818  823  838  839  853  920  922  971
## [91]  987  996 1152 1292 1468 1470

We can see that many good results were obtained, but not nearly all of the best configurations were found out. This can be explained by the fact that we often imposed constraints to reduce the size of the data set. For example, for some categorical parameters, we always chose one factor even though we knew that other categories could also yield good values. Furthermore, numerical parameters were partly restricted, although it was known that for some very good configurations, very good yval values can also be obtained outside the range. In the end, however, we were able to show that the ranges we restricted lead to almost exclusively above-average or good performance values.

Visual Overview

With the implemented PCP it can be visually checked. This can be checked visually with the implemented PCP. For a better overview, the color range is somewhat restricted, since there are very few observations below -0.3. For a better comparison, the presumed good range and the presumed worse configuration range of the parameters are shown once.

plotParallelCoordinate(superTask, labelangle = 10, colbarrange = c(-0.21, -0.3))

Limitation to very good configurations

knitr::include_graphics("Super_Best_PCP.png")

Limitation to bad configurations

knitr::include_graphics("Super_Bad_PCP.png")

Overview

An overview is obtained again.

Structure

str(smashy_super)
## 'data.frame':    2845 obs. of  12 variables:
##  $ budget_log_step             : num  0.1145 -0.4292 0.0482 -1.4432 0.3798 ...
##  $ survival_fraction           : num  0.261 0.3376 0.0149 0.5771 0.1676 ...
##  $ surrogate_learner           : Factor w/ 4 levels "bohblrn","knn1",..: 3 3 3 3 1 3 3 3 4 4 ...
##  $ filter_with_max_budget      : Factor w/ 2 levels "FALSE","TRUE": 1 2 2 2 1 1 2 2 1 2 ...
##  $ filter_factor_first         : num  0.23378 3.75637 1.00239 6.4045 0.00425 ...
##  $ random_interleave_fraction  : num  0.225 0.104 0.542 0.629 0.732 ...
##  $ random_interleave_random    : Factor w/ 2 levels "FALSE","TRUE": 2 2 1 2 2 2 1 1 2 1 ...
##  $ sample                      : Factor w/ 2 levels "bohb","random": 1 2 2 1 2 1 1 2 2 2 ...
##  $ filter_factor_last          : num  0.387 1.589 2.927 1.853 4.002 ...
##  $ filter_algorithm            : Factor w/ 2 levels "progressive",..: 1 1 1 2 2 2 1 1 2 2 ...
##  $ filter_select_per_tournament: num  2.27 2.3 1.93 1.77 2.28 ...
##  $ yval                        : num  -0.221 -0.216 -0.212 -0.212 -0.212 ...

We want to look at the importance for the whole dataset (general case) and for the best configurations (top 20%).

Importance General

plotImportance(task = superTask)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Importance Best

plotImportance(task = superTaskBest)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

For the full data set, surrogate_learner is the most and sample the second most important hyperparameter. After filtering the dataset, both parameters lose much of their importance and have little effect, so random_interleave_fraction becomes the most important parameter. Parameters like filter_algorithm, random_interleave_random and filter_with_max_budget have no effect on the full dataset nor on the filtered dataset.

After we have subdivided the data, we also want to look for structural changes in the summary.

Summary All

summary(smashy_super)
##  budget_log_step   survival_fraction   surrogate_learner filter_with_max_budget
##  Min.   :-1.7509   Min.   :0.0001849   bohblrn: 374      FALSE:1119            
##  1st Qu.:-0.8770   1st Qu.:0.1864801   knn1   :1658      TRUE :1726            
##  Median :-0.0860   Median :0.3550278   knn7   : 478                            
##  Mean   :-0.2054   Mean   :0.4194451   ranger : 335                            
##  3rd Qu.: 0.4727   3rd Qu.:0.6533882                                           
##  Max.   : 1.0186   Max.   :0.9999182                                           
##  filter_factor_first random_interleave_fraction random_interleave_random
##  Min.   :0.004248    Min.   :0.000615           FALSE:1624              
##  1st Qu.:2.454531    1st Qu.:0.308627           TRUE :1221              
##  Median :4.393864    Median :0.545574                                   
##  Mean   :4.066960    Mean   :0.536262                                   
##  3rd Qu.:5.794467    3rd Qu.:0.774285                                   
##  Max.   :6.906027    Max.   :0.999015                                   
##     sample     filter_factor_last    filter_algorithm
##  bohb  :1226   Min.   :0.004248   progressive: 909   
##  random:1619   1st Qu.:2.268931   tournament :1936   
##                Median :4.183293                      
##                Mean   :3.911979                      
##                3rd Qu.:5.670457                      
##                Max.   :6.906027                      
##  filter_select_per_tournament      yval        
##  Min.   :0.0009299            Min.   :-0.3732  
##  1st Qu.:1.0000000            1st Qu.:-0.2390  
##  Median :1.0000000            Median :-0.2331  
##  Mean   :1.0740216            Mean   :-0.2347  
##  3rd Qu.:1.0869452            3rd Qu.:-0.2278  
##  Max.   :2.3956034            Max.   :-0.2105

Summary Best 20%

summary(superBest)
##  budget_log_step    survival_fraction  surrogate_learner filter_with_max_budget
##  Min.   :-1.74596   Min.   :0.000291   bohblrn:  2       FALSE:127             
##  1st Qu.:-0.46235   1st Qu.:0.121852   knn1   :546       TRUE :442             
##  Median : 0.25398   Median :0.256286   knn7   : 19                             
##  Mean   : 0.04121   Mean   :0.320271   ranger :  2                             
##  3rd Qu.: 0.61932   3rd Qu.:0.433896                                           
##  Max.   : 1.01297   Max.   :0.992048                                           
##  filter_factor_first random_interleave_fraction random_interleave_random
##  Min.   :0.004248    Min.   :0.02443            FALSE:337               
##  1st Qu.:3.697472    1st Qu.:0.43278            TRUE :232               
##  Median :5.308223    Median :0.63116                                    
##  Mean   :4.710573    Mean   :0.61323                                    
##  3rd Qu.:6.174077    3rd Qu.:0.82455                                    
##  Max.   :6.899001    Max.   :0.98931                                    
##     sample    filter_factor_last    filter_algorithm
##  bohb  :156   Min.   :0.1005     progressive:202    
##  random:413   1st Qu.:2.7705     tournament :367    
##               Median :4.8008                        
##               Mean   :4.3414                        
##               3rd Qu.:6.0197                        
##               Max.   :6.8990                        
##  filter_select_per_tournament      yval        
##  Min.   :0.001125             Min.   :-0.2270  
##  1st Qu.:1.000000             1st Qu.:-0.2261  
##  Median :1.000000             Median :-0.2249  
##  Mean   :1.055841             Mean   :-0.2244  
##  3rd Qu.:1.000000             3rd Qu.:-0.2234  
##  Max.   :2.381424             Max.   :-0.2105

These summary already explains why the parameter surrogate_learner lost most of its importance. Many bohblrn, knn7 and rangers were kicked out. This could mean that these learner perform worse on average than the knn1 learner. For the parameter filter_with_max_budget many configurations with FALSE were filtered out in disproportionate numbers. This could means that TRUE values perform better on average. It is also noted that the summary values of survival_fraction have decreased and increased for budget_log_step , Filter_factor_first and random_interleave_fraction. Finally, a disproportionate number of “bohb”samples also dropped out of the data set. Perhaps this is an indication that “ranom” samples gave better results.

The hyperparameter will be examined in following sections more precise.

Examination of the parameters

sample

As we could find out, “sample” is again an important parameter in the full dataset and can take the values “bohb” or “random”. This parameter should have the right value for good performance. Therefore, let us consider the effects of the parameter in a partial dependence plot. We also check if the effect applies to all parameters. We can use a heatmap to get a quick overview of interactions. Values close to 1 have barealy an effect on the outcome.

PDP

plotPartialDependence(superTask, features = c("sample"), rug = FALSE, plotICE = FALSE)

Heatmap

subplot(
plotHeatmap(superTask, features = c("sample", "budget_log_step"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "survival_fraction"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "surrogate_learner"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "filter_with_max_budget"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "filter_factor_first"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "random_interleave_fraction"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "random_interleave_random"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "filter_factor_last"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "filter_algorithm"), rug = FALSE),
plotHeatmap(superTask, features = c("sample", "filter_select_per_tournament"), rug = FALSE),
nrows = 5,shareX = TRUE)

In the PDP, it can be seen that the target values for “random” samples lead to better results on average than for “bohb” samples. In the heatmaps, it can be seen that the predicted performances may be better when filter_with_max_budget is set to “TRUE”, random_interleave_fraction is given a high value and survival_fraction is given a low value. As suspected since the Summary, the surrogate_learner knn1 seems to give better results. This means that knn1 gives the best results on average.

Top 20%

we can split the data according to the best 20% yval values of the dataset and check if the outcome of a PDP is different.

plotPartialDependence(superTaskBest, features = c("sample"), rug = TRUE, plotICE = TRUE)

A lot of “bohb” samples were sorted out, but the remaining ones have on average a better performance than the “random” samples. Since both subsets seem important for further analysis, we split the entire dataset. Furthermore, we assume differences between “random” and “bohb” samples, since the parameter has lost much of its importance after filtering. Therefore we split the data set into “bohb” and “random” samples.

random <- smashy_super[smashy_super$sample == "random",]
bohb <- smashy_super[smashy_super$sample == "bohb",]

randomSubset <- TaskRegr$new(id = "task_random", backend = random, target = "yval")
bohbSubset <- TaskRegr$new(id = "task_bohb", backend = bohb, target = "yval")

Let’s check if there are differences in importance for the parameters in the random subset and the Bohb subset.

Subset bohb

plotImportance(task = bohbSubset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Subset random

plotImportance(task = randomSubset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

The hyperparameter surrogate_learner and random_interleave_fraction are still the most important parameter for both partial datasets. In fact, the importance didn’t change a lot.

There is little difference between the two samples in the full data set. We did find that the majority of the good results were obtained with the “random” samples, but for further analysis we will look at both the “random” subset and the “bohb” subset.

survival_fraction

The survival rate parameter was a moderately important parameter for both samples of the entire data set, but we assumed based on the summary that low values may lead to better performance. This parameter can take values between 0.00007 and 0.9998. Let us explore this assumption with a PDP.

Subset bohb

plotPartialDependence(bohbSubset, features = c("survival_fraction"), rug = TRUE, plotICE = FALSE) 

Subset random

plotPartialDependence(randomSubset, features = c("survival_fraction"), rug = TRUE, plotICE = FALSE)

In general, lower values perform better than higher values. Both subsets start with a low value and reach their maximum value directly afterwards. This means that the value should probably be low, but not minimal. For both subsets, the best range seems to be between 0.05 and 0.25. While the “random” samples are almost monotonly decreasing the “bohb” samples has another height between 0.5 and 0.75.

Top 20%

A possibility to find analyze the structure is to filter the data again. For this we can split the data according to the best 20% yval values of the bohb samples. We can review “bohb” samples with ICE-Curves. ICE-Curvers can show the heterogeneous relationship between the parameter survival_fraction and the performance parameter yval created by interactions.

bohbBest <- bohb[bohb$yval >= quantile(bohb$yval, 0.8),]
bohbBestTask <- TaskRegr$new(id = "bohbBestTask", backend = bohbBest, target = "yval")

randomBest <- bohb[bohb$yval >= quantile(bohb$yval, 0.8),]
randomBestTask <- TaskRegr$new(id = "randomBestTask", backend = bohbBest, target = "yval")

bohb Best

plotPartialDependence(bohbBestTask, features = c("survival_fraction"), rug = TRUE, plotICE = TRUE)

random Best

plotPartialDependence(randomBestTask, features = c("survival_fraction"), rug = TRUE, plotICE = TRUE)

In this case, higher values do not seem to be worse. This is surprising, since in the general case low values were more important. It could mean that with good configurations of other parameters, the survival_fraction parameter even gives better results when a high value is chosen. This could also explain the increase in the range between 0.5 and 0.75 for the “bohb” sample. Looking at the rug, we see that most configurations were made below 0.5 and the fewest configurations were made above 0.75. Because of the few configurations with high values, the effect of good performances in this range is less strong. In the range between 0.5 and 0.75, there are more configurations, which therefore have a greater impact on the average curve. Although not all high values have poor performance, lower values seem to be the right choice since most good configurations have lower values.

surrogate_learner

A very important parameter for the bohb subset was the surrogate_learner. We can already assume that “knn1” is the most important surrogate_learner, since many other surrogate_learner were filtered out in the top 20% dataset. But let’s check this with a PDP.

Subset bohb

plotPartialDependence(bohbSubset, features = c("surrogate_learner"), rug = FALSE, plotICE = FALSE)

#### Subset bohb

plotPartialDependence(randomSubset, features = c("surrogate_learner"), rug = FALSE, plotICE = FALSE)

In both subsets, knn1 is actually the best choice based on the PDP. There does not seem to be much difference in the other parameters. For a more detailed analysis, we should split the data into the individual surrogate learners and see if there are differences in the importance of the other parameters. Although it would be interesting to analyze the learners for both samples separately, we focus on the whole dataset to make it less complicated and because the importance of the subsets does not differ much.

knn1Surrogate <- smashy_super[smashy_super$surrogate_learner == "knn1",] 
knn7Surrogate <- smashy_super[smashy_super$surrogate_learner == "knn7",] 
bohblrnSurrogate <- smashy_super[smashy_super$surrogate_learner == "bohblrn",]
rangerSurrogate <- smashy_super[smashy_super$surrogate_learner == "ranger",]

knn1Subset <- TaskRegr$new(id = "knn1Task", backend = knn1Surrogate, target = "yval")
knn7Subset <- TaskRegr$new(id = "knn7task", backend = knn7Surrogate, target = "yval")
bohblrnSubset <- TaskRegr$new(id = "bohblrnTask", backend = bohblrnSurrogate, target = "yval")
rangerSubset <- TaskRegr$new(id = "rabgerTask", backend = rangerSurrogate, target = "yval")

Subset: knn1

plotImportance(knn1Subset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Subset: knn7

plotImportance(knn7Subset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Subset: bohblrn

plotImportance(bohblrnSubset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

Subset: ranger

plotImportance(rangerSubset)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

The parameter sample, random_interleave_fraction are most important for “knn1”, “knn7” and “ranger.” For the “bohblrn” the parameter survival_fraction is more important than the parameter random_interleave_fraction. The parameter filter_with_max_budget has barely effect for all parameter but the knn1 learner. These are the parameters we should check more closely

Most important Parameter for nearly all surrogate_learner is “sample”.

knn1: sample

plotPartialDependence(knn1Subset, "sample", rug = FALSE)

knn7: sample

plotPartialDependence(knn7Subset, "sample", rug = FALSE)

bohblrn: sample

plotPartialDependence(bohblrnSubset, "sample", rug = FALSE)

ranger: sample

plotPartialDependence(rangerSubset, "sample", rug = FALSE)

We already knew that random is better on average but know we also know that this assumption is true for all surrogate_learner

knn1: random_interleave_fraction

plotPartialDependence(knn1Subset, "random_interleave_fraction", plotICE = FALSE)

knn7: random_interleave_fraction

plotPartialDependence(knn7Subset, "random_interleave_fraction", plotICE = FALSE)

bohblrn: random_interleave_fraction

plotPartialDependence(bohblrnSubset, "random_interleave_fraction", plotICE = FALSE)

ranger: random_interleave_fraction

plotPartialDependence(rangerSubset, "random_interleave_fraction", plotICE = FALSE)

For the parameter random_interleave_fraction higher values always seem to be better. For “knn1” and “knn7”, low random_interleave_fraction values seem to have a stronger negative impact on the prediction than a low value for “ranger” or “bohblrn”. For the surrogate_learner “knn1” and “bohblrn”, the maximum results in slightly worse predicted performance, but since there are few instances, this is not certain. Values between 0.75 and 0.95 can be considered optimal values for the parameter.

Another important parameter for all surrogate_learner is the survival_fraction parameter. Also, for the “bohblrn” the parameter survival_fraction was noticeably more important than for other learners. Thats why we look at this parameter next.

knn1: survival_fraction

plotPartialDependence(knn1Subset, "survival_fraction")

knn7: survival_fraction

plotPartialDependence(knn7Subset, "survival_fraction")

bohblrn: survival_fraction

plotPartialDependence(bohblrnSubset, "survival_fraction")

ranger: survival_fraction

plotPartialDependence(rangerSubset, "survival_fraction")

Low value for survival_fraction are better in general for the learners “knn1”, “knn7”. For knn1 a value close to 0 and for knn7 a value between 0.05 and 0.15 should be considered. For “bohblrn” values around 0.25 and 0.35 and for "ranger around 0.15 and 0.25 seems to produce best predicted performances.

The last parameter we want to check if filter_with_max_budget. It was only important for knn1 and not important for the other parameters.

knn1: filter_with_max_budget

plotPartialDependence(knn1Subset, "filter_with_max_budget")

knn7: filter_with_max_budget

plotPartialDependence(knn7Subset, "filter_with_max_budget")

bohblrn: filter_with_max_budget

plotPartialDependence(bohblrnSubset, "filter_with_max_budget")

ranger: filter_with_max_budget

plotPartialDependence(rangerSubset, "filter_with_max_budget")

When we compared the importance of surrogate_learner, we found that the filter_with_max_budget parameter was only important for “knn1”. Here we can see that for “knn1” the parameter filter_with_max_budget should be set to “TRUE”. For other parameters it is indeed not important if the parameter is set to “TRUE” or “FALSE”.

Top 20%

when we compared the summary of the full dataset with the top 20% configurations we could see that both, random and bohb samples were left. We also could see that mostly knn1 learner were left. To see if it is still possible to gain good results with these learner lets have a look on max values for all the learners.

summary(superBest$surrogate_learner)
## bohblrn    knn1    knn7  ranger 
##       2     546      19       2
aggregate(x = superBest$yval,                
          by = list(superBest$surrogate_learner),              
          FUN = max) 
##   Group.1          x
## 1 bohblrn -0.2117795
## 2    knn1 -0.2170470
## 3    knn7 -0.2105208
## 4  ranger -0.2124898

It is interesting to see that the best configuration of each learner, filtered out in large numbers, achieve a better yval than for the “knn1” learner. This is important because with this finding we know that it is indeed possible to achieve good results with all learners and not only with “knn1.” But “knn1” achieves the best results on average, which means that this learner is more robust and changes in configuration compared to the other learners do not have such a large negative impact on performance.

We also want to investige the best cases and for this directly check the subdivided datasets.

surrogate_learner knn1

Lets investigate knn1 a bit more. Because we have less data, we also can also make use of a Parallel Coordiante Plot.

knn1Best <- bohbBest[bohbBest$surrogate_learner == "knn1",]

knn1BestTask <- TaskRegr$new(id = "task", backend = knn1Best, target = "yval")

PCP knn1

plotParallelCoordinate(knn1BestTask, labelangle = 10)

Importance Plot knn1

plotImportance(knn1BestTask)
## Scale for 'x' is already present. Adding another scale for 'x', which will
## replace the existing scale.

In the PCP it can be seen that the parameter filter_with_max_budget should set to “TRUE”, random_interleave_random to “FALSE” and random_interleave_fraction should be high for good results.

Accordint Importance Plot The paramter filter_factor_first and filter_factor_last. are very important as well and should be further examined.

knn1: PDP filter_factor_first

plotPartialDependence(knn1BestTask, "filter_factor_first" )

knn1: Importance filter_factor_last

plotPartialDependence(knn1BestTask, "filter_factor_last")

In the PDP we can see that filter_factor_first should be high and fitler_factor_last has best outcome for values beteen 1.5 and 2.5 or above 6

budget_log_step

Another very important parameter for random Subsets and for the filtered dataset is the budget_log_step parameter. First, let us investigate the parameter with a PDP for the full dataset.

Subset bohb

plotPartialDependence(bohbSubset, features = c("budget_log_step"), rug = FALSE, plotICE = FALSE)

Subset random

plotPartialDependence(randomSubset, features = c("budget_log_step"), rug = FALSE, plotICE = FALSE)

For the random Subset higher values produces better outcomes. For the bohbSubset there are two peaks around -0.5 and 0.5. To find reasons for the two peaks lets focus on the top 20% again.

top 20 %

bohb Best

plotPartialDependence(bohbBestTask, features = c("budget_log_step"), rug = TRUE, plotICE = TRUE)

random Best

plotPartialDependence(randomBestTask, features = c("budget_log_step"), rug = TRUE, plotICE = TRUE)

Similar to the survival_fraction parameter, configurations with a low value seem to have a positive rather than negative effect on performance if the other parameters are set correctly. This could be the reason why there are two weapks for the “bohb” sample.

If we look on low values only we can see that the predicted performance varies a lot and that other parameter configurations are responsible. We choose budget_log_step values under -1.4 to get less than 150 configurations.

budgetSubset <- random[random$budget_log_step < -1.4,]

budgetSubsetTask <- TaskRegr$new(id = "bohbBestTask", backend = budgetSubset, target = "yval")

plotParallelCoordinate(budgetSubsetTask, labelangle = 10)

In the PCP we can see that good values are often obtained with a “knn1” learner. A low survival_fraction is also important. The random_interleave_fraction parameter should be high instead.

PDP

Another possibiliy is to look on a two dimensional partial dependence plot. We compare budget_log_step with the 2 parameter we found in the PCP.

survival_fraction

plotPartialDependence(randomSubset, features = c("budget_log_step", "survival_fraction"), rug = FALSE, gridsize = 10)

random_interleave_fraction

plotPartialDependence(randomSubset, features = c("budget_log_step", "random_interleave_fraction"), rug = FALSE, gridsize = 10)

We can see that high values have less poor performance when other parameters are also poorly configured. Conversely, it is also possible to achieve good values when budget_log_step is low and the other parameters are well configured.

random_interleave_fraction

Random_interleave_fraction can vary between 0 and 1. This parameter had a high performance in both subsets and was also the most important parameter for the 20% best configurations. Therefore it is really useful to check this parameter.

bohb Subset

plotPartialDependence(bohbSubset, features = c("random_interleave_fraction"), rug = FALSE, plotICE = FALSE)

random Subset

plotPartialDependence(randomSubset, features = c("random_interleave_fraction"), rug = FALSE, plotICE = FALSE)

A good choice for the parameter configuration for random_interleave_fraction of the “bohb” samples is a high value. A good range seems to be between 0.75 and 0.95, For the random samples a high value between 0.5 and 0.75 seems to produce best performances.

top 20%

plotPartialDependence(bohbBestTask, features = c("random_interleave_fraction"), rug = FALSE, gridsize = 20)

top 20%

plotPartialDependence(randomBestTask, features = c("random_interleave_fraction"), rug = FALSE, gridsize = 20)

The filtered dataset shows that low values doesn’t have such a bad negative impact on the outcome but high values are better. A value should be chosen over 0.5

filter_factor_last

The parameter filter_factor_last was just medicore important but a little check is good as well.

Bohb: full dataset

plotPartialDependence(bohbSubset, "filter_factor_last", plotICE = FALSE, gridsize = 40)

bohb: subdivided dataset

plotPartialDependence(bohbBestTask, features = c("filter_factor_last"), rug = TRUE, plotICE = FALSE, gridsize = 40)

random: full dataset

plotPartialDependence(randomSubset, "filter_factor_last", plotICE = FALSE, gridsize = 40)

random: subdivided dataset

plotPartialDependence(randomBestTask, features = c("filter_factor_last"), rug = TRUE, plotICE = FALSE, gridsize = 40)

Filter_factor_last has much fluctuation and therefore we choose a higher gridsize. When the fluctuations raise the importance raises as well even the range of predicted performances is not really big. the parameter value for Filter_factor_last should be between 1.5 and 2.5 or For bohb samples over 5.5 and for random samples between 5 and 5.5.

filter_with_max_budget

Bohb: full dataset

plotPartialDependence(bohbSubset, "filter_with_max_budget", rug = FALSE)

bohb: subdivided dataset

plotPartialDependence(bohbBestTask, features = c("filter_with_max_budget"), rug = FALSE)

random: full dataset

plotPartialDependence(randomSubset, "filter_with_max_budget", rug = FALSE)

random: subdivided dataset

plotPartialDependence(randomSubset, features = c("filter_with_max_budget"), rug = FALSE)

The parameter filter_with_max_budget has a weak effect but should be set to “TRUE”.

filter_select_per_tournament

This parameter had barely an effect on the general case but got a little more important in the top 20% configurations. We check the partial dependence and the dependencies with the most important parameters to get more insight.

Bohb: full dataset

plotPartialDependence(bohbSubset, features = c("filter_select_per_tournament"), rug = FALSE, plotICE = FALSE)

Bohb: subdivided dataset

plotPartialDependence(bohbBestTask, features = c("filter_select_per_tournament"), rug = FALSE, plotICE = FALSE)

random: full dataset

plotPartialDependence(randomSubset, features = c("filter_select_per_tournament"), rug = FALSE, plotICE = FALSE)

random: subdivided dataset

plotPartialDependence(randomBestTask, features = c("filter_select_per_tournament"), rug = FALSE, plotICE = FALSE)

The effect is weak and maybe comes from the peaks around 1 - 1.3. The parameter should be probably choosen between 1 or slightly better but the effect shouldn’t effect much.

filter_factor_first

This parameter had barely an effect on the general case but got a little more important in the top 20% configurations. We check the partial dependence and the dependencies with the most important parameters to get more insight.

Bohb: full dataset

plotPartialDependence(bohbSubset, features = c("filter_factor_first"), rug = FALSE, plotICE = FALSE)

Bohb: subdivided dataset

plotPartialDependence(bohbBestTask, features = c("filter_factor_first"), rug = TRUE, plotICE = FALSE)

random: full dataset

plotPartialDependence(randomSubset, features = c("filter_factor_first"), rug = FALSE, plotICE = FALSE)

random: subdivided dataset

plotPartialDependence(randomBestTask, features = c("filter_factor_first"), rug = TRUE, plotICE = FALSE)

The parameter filter_factor_first shows interesting differences between the general and the subdivided case. While in the general cases values above 6 are decreasing a lot in the subset these values show best performances. Since in the subset the majority of good cases are in this area it seems to be a good choice to pick a value over 6